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Abstract 

Consider the problem of joint parameter estimation and prediction in a Markov random 
field: i.e., the model parameters are estimated on the basis of an initial set of data, and then 
the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a 
new noisy observation. Working under the restriction of limited computation, we analyze a joint 
method in which the same convex variational relaxation is used to construct an M-estimator 
for fitting parameters, and to perform approximate marginalization for the prediction step. 
The key result of this paper is that in the computation-limited setting, using an inconsistent 
parameter estimator (i.e., an estimator that returns the "wrong" model even in the infinite data 
limit) can be provably beneficial, since the resulting errors can partially compensate for errors 
made by using an approximate prediction technique. En route to this result, we analyze the 
asymptotic properties of M- estimators based on convex variational relaxations, and establish 
a Lipschitz stability property that holds for a broad class of variational methods. We show 
that joint estimation/prediction based on the reweighted sum-product algorithm substantially 
outperforms a commonly used heuristic based on ordinary sum-product. 

Keywords: Graphical model; Markov random field; belief propagation; sum-product algorithm; variational method; 
parameter estimation; pseudolikelihood; prediction error; Lipschitz stability; mixture of Gaussian. 

1 Introduction 



Graphical models such as Markov random fields (MRFs) are widely used in many application 
domains, including spatial statistics, statistical signal processing, and communication theory. A 
fundamental limitation to their practical use is the infeasibility of computing various statistical 
quantities (e.g., marginals, data likelihoods etc.); such quantities are of interest both Bayesian 
and frequentist settings. Sampling-based methods, especially those of the Markov chain Monte 
Carlo (MCMC) variety ^l^j, represent one approach to obtaining stochastic approximations to 
marginals and likelihoods. A disadvantage of sampling methods is their relatively high computa- 
tional cost. For instance, in applications with severe limits on delay and computational overhead 
(e.g., error-control coding, real-time tracking, video compression), MCMC methods are likely to 
be overly slow. It is thus of considerable interest for various application domains to consider less 
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computationally intensive methods for generating approximations to marginals, log likelihoods, and 
other relevant statistical quantities. 

Variational methods are one class of techniques that can be used to generate deterministic 
approximations in Markov random fields (MRFs). At the foundation of these methods is the fact 
that for a broad class of MRFs, the computation of the log likelihood and marginal probabilities 
can be reformulated as a convex optimization problem (see |32( I.SOj for an overview). Although this 
optimization problem is intractable to solve exactly for general MRFs, it suggests a principled route 
to obtaining approximations — namely, by relaxing the original optimization problem, and taking 
the optimal solutions to the relaxed problem as approximations to the exact values. In many cases, 
optimization of the relaxed problem can be carried out by "message-passing" algorithms, in which 
neighboring nodes in the Markov random field convey statistical information (e.g., likelihoods) by 
passing functions or vectors (referred to as messages). 

Estimating the parameters of a Markov random field from data poses another significant chal- 
lenge. A direct approach — for instance, via (regularized) maximum likelihood estimation — entails 
evaluating the cumulant generating (or log partition) function, which is computationally intractable 
for general Markov random fields. One viable option is the pseudolikelihood method [SI 13) which 
can be shown to produce consistent parameter estimates under suitable assumptions, though with 
an associated loss of statistical efficiency. Other researchers have studied algorithms for ML es- 
timation based on stochastic approximation |36| ^ , which again are consistent under appropriate 
assumptions, but can be slow to converge. 

1.1 Overview 

As illustrated in Figure the problem domain of interest in this paper is that of joint estimation 
and prediction in a Markov random field. More precisely, given samples {X^, . . . ,X^} from some 
unknown underlying model p{- ;6*), the first step is to form an estimate of the model parameters. 
Now suppose that we are given a noisy observation of a new sample path Z ~ |?( • ; 0*), and that 
we wish to form a (near)-optimal estimate of Z using the fitted model, and the noisy observation 
(denoted Y). Examples of such prediction problems include signal denoising, image interpolation, 
and decoding of error-control codes. Disregarding any issues of computational cost and speed, 
one could proceed via Route A in Figure Q — that is, one could envisage first using a standard 
technique (e.g., regularized maximum likelihood) for parameter estimation, and then carrying out 
the prediction step (which might, for instance, involve computing certain marginal probabilities) 
by Monte Carlo methods. 

This paper, in contrast, is concerned with the computation-limited setting, in which both sam- 
pling or brute force methods are overly intensive. With this motivation, a number of researchers 
have studied the use of approximate message-passing techniques, both for problems of predic- 
tion [ElinillSlESlEZllSlES] as well as for parameter estimation 1221 ESI EH] • However, despite 
their wide-spread use, the theoretical understanding of such message-passing techniques remains 
limited^, especially for parameter estimation. Consequently, it is of considerable interest to char- 
acterize and quantify the loss in performance incurred by using computationally tractable methods 
versus exact methods (i.e.. Route B versus A in Figure P). More specifically, our analysis applies to 

^The behavior of sum-product is relatively well understood in certain settings, including graphs with single cy- 
cles )33| and Gaussian models 1^ 1211 . Similarly, there has been substantial progress for graphs with high girth |lfci|. 
but much of this analysis breaks down in application to graphs with short cycles 
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variational methods that are based convex relaxations. This class includes a broad range of extant 
methods — among them the tree-reweighted sum-product algorithm [201, reweighted forms of gen- 
eralized belief propagation [35, and semidefinite relaxations j32j. Moreover, it is straightforward 
to modify other message-passing methods (e.g., expectation propagation [12]) so as to "convexify" 
them. 

At a high level, the key idea of this paper is the following: given that approximate methods can 
lead to errors at both the estimation and prediction phases, it is natural to speculate that these 
sources of error might be arranged to partially cancel one another. The theoretical analysis of this 
paper confirms this intuition: we show that with respect to end-to-end performance, it is in fact 
beneficial, even in the infinite data limit, to learn the "wrong" the model by using inconsistent 
methods for parameter estimation. En route to this result, we analyze the asymptotic properties of 
M-estimators based on convex variational relaxations, and establish a Lipschitz stability property 
that holds for a broad class of variational methods. We show that joint estimation/prediction based 
on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic 
based on ordinary sum-product. 
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Figure 1. Route A: computationally intractable combination of parameter estimation and pre- 
diction. Route B: computationally efficient combination of approximate parameter estimation and 
prediction. 

The remainder of this paper is organized as follows. Section [21 provides background on Markov 
random fields and associated variational representations, as well as the problem statement. In 
Section [SJ we introduce the notion of a convex surrogate to the cumulant generating function, 
and then illustrate this notion via the tree-reweighted Bethe approximation [29j. In Sectional we 
describe how any convex surrogate defines a particular joint scheme for parameter estimation and 
prediction. Section [S] provides results on the asymptotic behavior of the estimation step, as well as 
the stability of the prediction step. Section [HI is devoted to the derivation of performance bounds 
for joint estimation and prediction methods, with particular emphasis on the mixture-of-Gaussians 
observation model. In Section [3 we provide experimental results on the performance of a joint 
estimation/prediction method based on the tree-reweighted Bethe surrogate, and compare it to a 
heuristic method based on the ordinary belief propagation algorithm. We conclude in Section [H] 
with a summary and discussion of directions for future work. 
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2 Background and problem statement 



2.1 Multinomial Markov random fields 

Consider an undirected graph G = iV, E), consisting of a set of vertices V = {1, . . . , N} and an edge 
set E. We associate to each vertex s £ V a multinomial random variable Xg taking values in the 
set = {0, 1, . . . , m — 1}. We use the lower case letter Xs to denote particular realizations of the 
random variable Xg in the set Xg. This paper makes use of the following exponential representation 
of a pairwise Markov random field over the multinomial random vector X := {Xg, s S V}. We 
begin by defining, for each j = l,...,m — 1, the {0, l}-valued indicator function 

I,[xg] := '^"^^=3 
I otherwise 

These indicator functions can be used to define a potential function 9g{-) : Xg via 

m—l 

9g{xg) := (2) 
i=i 

where 9g = {9g-j,j = 1, ... ,m — 1} is the vector of exponential parameters associated with the 
potential. Our exclusion of the index j = is deliberate, so as to ensure that the collection of 
indicator functions <j3g{xg) := {Ij[xg], j = 1, . . . ,m — 1} remain affinely independent. In a similar 
fashion, we define for any pair (s, t) £ E the pairwise potential function 



m—l m—l 



{xg,xt) := '^'^9gt-jk^j[xg]Ik[xt\, (3) 
j=i k=i 



where we use 9gt ■= {9gt-jk^ J; ^ = 1) 2, . . . , m — 1} to denote the associated collection of exponen- 
tial parameters, and 4>gt{xg,xt) := {I j, = 1,2,... ,m — 1} for the associated set of 
sufficient statistics. 

Overall, the probability mass function of the multinomial Markov random field in exponential 
form can be written as 

p{x;9) = exp { ^ 6's(xs) + ^ 9gt {xg , Xf )-A{9)}. (4) 

seV {s,t)eE 

Here the function 

A{9) := log [ E { E 

(xg) + E 9gt{xg,xt)}\ (5) 
xi^x^ sev {s,t)eE 

is the logarithm of the normalizing constant associated with p{- ; 9). 

The collection of distributions thus defined can be viewed as a regular and minimal exponential 
family [^. In particular, the exponential parameter 9 and the vector of sufficient statistics (p are 
formed by concatenating the exponential parameters (respectively indicator functions) associated 
with each vertex and edge — viz. 

9 = {9g,s£V}U{9gt, {s,t)eE} (6a) 
(P{x) = {(l)g{xg),s£V}U{^gtixg,Xt), is,t)GE} (6b) 
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This notation allows us to write equation @ more compactly as p{x ; 6) = exp{{9, 4>{x)) — A{9)}. 
A quick calculation shows that 9 G W^, where d = N{m — 1) + \E\ [m — 1)^ is the dimension of this 
exponential family. 

The following properties of A are well-known: 

Lemma 1. (a) The function A is convex, and strictly so when the sufficient statistics are affinely 
independent. 

(h ) It is an infinitely differentiable function, with derivatives corresponding to cumulants. In par- 
ticular, for any indices a, f3 £ {1, . . . ,d}, we have 

OA d'^A 

— = Ee[0a(X)], ^^-^ = cove{<t^o.{X), (7) 
where Eg and covg denote the expectation and covariance respectively. 

We use fi £ to denote the vector of mean parameters defined element-wise by fia = ^elfpaiX)] 
for any aG{l,...,(i}. A convenient property of the sufficient statistics (j) defined in equations 
and ((2)) is that these mean parameters correspond to marginal probabilities. For instance, when 
= or a = {st;jk), we have respectively 

fj.s;j = Ee[Ij[xs]] = p{Xs = j ; 9), and (8a) 
fi,t;jk = ¥.e{lj[xs]Ik[xt]} =p{Xs=j,Xt = k;9). (8b) 



3 Construction of convex surrogates 

This section is devoted to a systematic procedure for constructing convex functions that represent 
approximations to the cumulant generating function. We begin with a quick development of an 
exact variational principle, one which is intractable to solve in general cases. (More details on this 
exact variational principle can be found in the papers [M)\ \^'2\.) Nonetheless, this exact variational 
principle is useful, in that various natural relaxations of the optimization problem can be used 
to define convex surrogates to the cumulant generating function. After a high-level description of 
such constructions in general, we then illustrate it more concretely with the particular case of the 
"convexified" Bethe entropy |29j . 



3.1 Exact variational representation 

Since ^ is a convex and continuous function (see Lemma the theory of convex duality JH] 
guarantees that it has a variational representation, given in terms of its conjugate dual function 
A* : M'^ ^ M U {+oo}, of the following form 

A{9) = sup {9'^fi- A*{^l)}. (9) 

In order to make effective use of this variational representation, it remains determine the form of 
the dual function. A useful fact is that the exponential family (jlj arises naturally as the solution 
of an entropy maximization problem. In particular, consider the set of linear constraints 

Ep[(j){X)] := p{x)(j)a{x) = fia for a = I, . . . ,d, (10) 
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where € M"^ is a set of target mean parameters. Letting V denote the set of all probability dis- 
tributions with support on , consider the constrained entropy maximization problem: maximize 
the entropy H{p) := — X^^g;^ jv logp(rE) subject to the constraints (fTU)) . 

A first question is when there any distributions p that satisfy the constraints . Accordingly, 
we define the set 

MARG0(G) := {/xeM'^|/x = Ep[0(X)] for some p G P}, (11) 

corresponding to the set of /u for which the constraint set pOj) is non-empty. For any ^ ^ MARG(^(G), 
the optimal value of the constrained maximization problem is — oo (by definition, since the problem 
is infeasible). Otherwise, it can be shown that the optimum is attained at a unique distribution in 
the exponential family, which we denote by p(- ; 9{fJ,))- Overall, these facts allow us to specify the 
conjugate dual function as follows: 

^ i-Hip{.;0{f,))) ifMGMARG^(G) ^^^^ 
1 -|-oo otherwise. 

See the technical report [SOj for more details of this dual calculation. With this form of the dual 
function, we are guaranteed that the cumulant generating function A has the following variational 
representation: 

A{d) = max {0^/.-^*(/.)}. (13) 

AfSMARG^CG) 

However, in general, solving the variational problem H13|) is intractable. This intractability should 
not be a surprise, since the cumulant generating function is intractable to compute for a general 
graphical model. The difficulty arises from two sources. First, the constraint set MARG^(G) is 
extremely difficult to characterize exactly for a general graph with cycles. For the case of a multino- 
mial Markov random field Q, it can be seen (using the Minkowski- Weyl theorem) that MARG0(G) 
is a polytope, meaning that it can be characterized by a finite number of linear constraints. The 
question, of course, is how rapidly this number of constraints grows with the number of nodes 
in the graph. Unless certain fundamental conjectures in computational complexity turn out to be 
false, this growth must be non-polynomial (see Deza and Laurent [Tj for an in-depth discussion 
of the binary case). Tree-structured graphs are a notable exception, for which the junction tree 
theory U2| guarantees that the growth is only linear in N. 

Second, the dual function A* lacks a closed-form representation for a general graph. Note in 
particular that the representation (|12)) is not explicit, since it requires solving a constrained entropy 
maximization problem in order to compute the value H{p{- ; 6{^))). Again, important exceptions 
to this rule are tree-structured graphs. Here a special case of the junction tree theory guarantees 
that any Markov random field on a tree T = {V, E{T)) can be factorized in terms of its marginals 
as follows 

Consequently, in this case, the negative entropy (and hence the dual function) can be computed 
explicitly as 

-A*{f,;T) = Y^Hsifis)- Yl ^^stM (15) 
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where Hsins) ■= - Ex. ^^(a;^) log /^^(a^s) and htifJ-st) ■= T.xs,xt f^st{xs,xt)log-j^f0fg^ are the 
singleton entropy and mutual information, respectively, associated with the node s £ V and edge 
(s,t) E E{T). For a general graph with cycles, in contrast, the dual function lacks such an explicit 
form, and is not easy to compute. 

Given these challenges, it is natural to consider approximations to A* and MARG0(G). As 
we discuss in the following section, the resulting relaxed optimization problem defines a convex 
surrogate to the cumulant generating function. 

3.2 Convex surrogates to the cumulant generating function 

We now describe a general procedure for constructing convex surrogates to the cumulant generating 
function, consisting of two main ingredients. Given the intractability of characterizing the marginal 
polytope MARG(^(G), it is natural to consider a relaxation. More specifically, let REL(^(G) be a 
convex and compact set that acts as an outer bound to MARG(j,{G). We use r to denote elements 
of REL^(G), and refer to them as pseudomarginals since they represent relaxed versions of local 
marginals. The second ingredient is to designed to sidestep the intractability of the dual function: 
in particular, let B* be a strictly convex and twice continuously differentiable approximation to 
A*. We require that the domain of B* (i.e., dom(i3*) := {r E W^\B*[t) < +00}) be contained 
within the relaxed constraint set REL(^(G). 

By combining these two approximations, we obtain a convex surrogate B to the cumulant 
generating function, specified via the solution of the following relaxed optimization problem 

B(9) := max \9^t-B*(t)}. (16) 

rGREL^{G) 

Note the parallel between this definition H16() and the variational representation of A in equa- 
tion (fTT?]) . 

The function B so defined has several desirable properties, as summarized in the following 
proposition: 

Proposition 1. Any convex surrogate B defined via ()16() has the following properties: 
(i) For each 6 E M*^, the optimum defining B is attained at a unique point t{9). 
(a) The function B is convex on W^. 
(Hi) It is differentiable on , and more specifically: 

VB{B) = T{e). (17) 

Proof, (i) By construction, the constraint set REL0(G) is compact and convex, and the function 
B* is strictly convex, so that the optimum is attained at a unique point t{9). 

(ii) Observe that B is defined by the maximum of a collection of functions linear in 9, which ensures 
that it is convex i2j. 

(iii) Finally, the function B'^t — B*{t) satisfies the hypotheses of Danskin's theorem pi, from which 
we conclude that B is differentiable with VB[9) = t{9) as claimed. □ 

Given the interpretation of t{9) as a pseudomarginal, this last property of B is analogous to the 
well-known cumulant generating property of A — namely, V^(^) = fi{9) — as specified in Lemma ^ 
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3.3 Convexified Bethe surrogate 



The following example provides a more concrete illustration of this constructive procedure, using 
a tree-based approximation to the marginal polytope, and a convexifed Bethe entropy approxi- 
mation As with the ordinary Bethe approximation IHS], the cost function and constraint set 
underlying this approximation are exact for any tree-structured Markov random field. 



Relaxed polytope: We begin by describing a relaxed version REL(^(G) of the marginal polytope 
MARG0(0). Let Tg and Tst represent a collection of singleton and pairwise pseudomarginals, 
respectively, associated with vertices and edges of a graph G. These quantities, as locally valid 
marginal distributions, must satisfy the following set of local consistency conditions: 

LOCA^(G) := {T^M.%\Y,rs{xs) = l, Y.^st{xs.xt)=T,{xs)]. (18) 

Xs Xt 

By construction, we are guaranteed the inclusion MARG0(G) C LOCAL0(G). Moreover, a special 
case of the junction tree theory guarantees that equality holds when the underlying graph is 
a tree (in particular, any r G LOCAL0(G) can be realized as the marginals of the tree-structured 
distribution of the form (|14j) ). However, the inclusion is strict for any graph with cycles; see 
Appendix for further discussion of this issue. 



Entropy approximation: We now define an entropy approximation that is finite for any 
pseudomarginal r in the relaxed set LOCAL(^(G). We begin by considering a collection {T e T} 
of spanning trees associated with the original graph. Given r G LOCAL^(G), there is — for each 
spanning tree T — a unique tree-structured distribution that has marginals and Tst on the vertex 
set V and edge set E{T) of the tree. Using equations (fTHl and (P3|) . the entropy of this tree- 
structured distribution can be computed explicitly. The convexified Bethe entropy approximation 
is based on taking a convex combination of these tree entropies, where each tree is weighted by a 
probability p{T) G [0, 1]. Doing so and expanding the sum yields 

B;{t) := Y.p{T){^Hs{Ts)- Yl ^«*(^^*)} = E PstUrst), (19) 

Tel seV {s,t)(^E{T) s&V {s,t)&E 

where pst = XIt pC^)^ ii^^ ^ ^] ^-^'^ the edge appearance probabilities defined by the distribution 
p over the tree collection. By construction, the function B* is differentiable; moreover, it can 
be shown [20] that it is strictly convex for any vector {pst} of strictly positive edge appearance 
probabilities. 



Bethe surrogate and reweighted sum-product: We use these two ingredients — the relaxation 
LOCAL^(G) of the marginal polytope, and the convexified Bethe entropy approximation (|19|) — to 
define the following convex surrogate 

Bpi0) := max {^'^r - B*(r)}. (20) 

Since the conditions of Proposition ^ are satisfied, we are guaranteed that Bp is convex and dif- 
ferentiable on M'^, and moreover that VBp[9) = t{9), where (for each 9 G W^) the quantity t{6) 
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denotes the unique optimum of problem (|2U() . Perhaps most importantly, the optimizing pseu- 
domarginals t{9) can be computed efficiently using a tree-reweighted variant of the sum-product 
message-passing algorithm |29j. This method operates by passing "messages", which in the multi- 
nomial case are simply m-vectors of non-negative numbers, along edges of the graph. We use 
Mts = {Mts{i),i = 0, . . . , m — 1} to represent the message passed from node t to node s. In the 
tree-reweighted variant, these messages are updated according to the following recursion 

^ ^ [Mst{xt)\ 

Upon convergence of the updates, the fixed point messages M* yield the unique global optimum of 
the optimization problem (|2flj) via the following equations 



Ts{xs;6) oc exp {Osixs)} n [Mus{xs)Y''% and (22a) 

uer(s) 

n [Mus{xs)Y''' n [M,s{xs)Y^' 

Tst{x.,x,;9) oc e.p{es{x.) + e,{x,) + ^-^^^^y-^ ^f^'^ ^2b) 

Pst Mst{Xt) Mts{Xs) 

Further details on these updates and their properties can be found in the paper ^H] . 



4 Joint estimation and prediction using surrogates 

We now turn to consideration of how convex surrogates, as constructed by the procedure described 
in the previous section, are useful for both approximate parameter estimation as well as prediction. 



4.1 Approximate parameter estimation 

Suppose that we are given i.i.d. samples {X^,...,X"'} from an MRF of the form (jlj, where 
the underlying true parameter 6* is unknown. One standard way in which to estimate 9* is via 
maximum likelihood (possibly with an additional regularization term); in this particular exponential 
family setting, it is straightforward to show that the (normalized) log likelihood takes the form 

e(0) = (/2", 0) - A{9) - y'R{9) (23) 

where function i? is a regularization term with an associated (possibly data-dependent) weight 
A". The quantities /I" := ^ Y17=i 4'{-^^) are the empirical moments defined by the data. For the 
indicator-based exponential representation @, these empirical moments correspond to a set of 
singleton and pairwise marginal distributions, denoted /i" and Ji% respectively. 

It is intractable to maximize the regularized likelihood directly, due to the presence of the 
cumulant generating function A. Thus, a natural thought is to use the convex surrogate B to 
define an alternative estimator obtained by maximizing the regularized surrogate likelihood: 

ieie) ■■= {r,e)-B{9)-\''R{9). (24) 

By design, the surrogate B and hence the surrogate likelihood is-, as well as their derivatives, can 
be computed in a straightforward manner (typically by some sort of message-passing algorithm). 
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It is thus straightforward to compute the parameter 0" achieving the maximum of the regularized 
surrogate hkehhood (for instance, gradient descent would a simple though naive method). 

For the tree-reweighted Bethe surrogate ()2Up . we have shown in previous work i28j that in 
the absence of regularization, the optimal parameter estimates 6"' have a very simple closed-form 
solution, specified in terms of the weights pst and the empirical marginals /I. (We make use of this 
closed form in our experimental comparison in Sectional see equation (|47)) .) If a regularizing term 
is added, these estimates no longer have a closed-form solution, but the optimization problem H24() 
can still be solved efficiently using the tree-reweighted sum-product algorithm |28\ 1291 . 

4.2 Joint estimation and prediction 

Using such an estimator, we now consider a joint approach to estimation and prediction. Recalling 
the basic set-up, we are given an initial set of i.i.d. samples {x^, . . . , x"} fromp(- ; 6*), where the true 
model parameter 9* is unknown. These samples are used to form an estimate of the Markov random 
field. We are then given a noisy observation y of a new sample z ~ p(- ; 6*), and the goal is to use 
this observation in conjunction with the fitted model to form a near-optimal estimate of z. The key 
point is that the same convex surrogate B is used both in forming the surrogate likelihood (|24|) for 
approximate parameter estimation, and in the variational method (|16j) for performing prediction. 

For a given fitted model parameter 6 £ M.'^, the central object in performing prediction is the 
posterior distribution p(z \ y; 6) p{z ; 9) p{y \ z). In the exponential family setting, for a fixed 
noisy observation y, this posterior can always be written as a new exponential family member, 
described by parameter 6 + 7(y). (Here the term 7(y) serves to incorporate the effect of the noisy 
observation.) With this set-up, the procedure consists of the following steps: 



Joint estimation and prediction: 

1. Form an approximate parameter estimate 0" from an initial set of i.i.d. data 
{x^, . . . by maximizing the (regularized) surrogate likelihood £b- 

2. Given a new noisy observation y (i.e., a contaminated version of z ~ p{- ; 0*)) specified 
by a factorized conditional distribution of the form p{y \ z) = Y['I=iPiys I ^s), incorpo- 
rate it into the model by forming the new exponential parameter 

^"(•)+7s(y) (25) 

where 7s (y) merges the new data with the fitted model 9"'. (The specific form of 7 
depends on the observation model.) 

3. Using the message-passing algorithm associated with the convex surrogate B, compute 
approximate marginals r(0 + 7) for the distribution that combines the fitted model 
with the new observation. Use these approximate marginals to construct a prediction 
z{y; t) of z based on the observation y and pseudomarginals r. 



Examples of the prediction task in the final step include smoothing (e.g., denoising of a noisy 
image) and interpolation (e.g., in the presence of missing data). We provide a concrete illustration 
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of such a prediction problem in Section El using a mixture-of-Gaussians observation model. The 
most important property of this joint scheme is that the convex surrogate B underhes both the 
parameter estimation phase (used to form the surrrogate hkehhood), and the prediction phase (used 
in the variational method for computing approximate marginals). It is this matching property that 
will be shown to be beneficial in terms of overall performance. 

5 Analysis 

In this section, we turn to the analysis of the surrogate-based method for estimation and prediction. 
We begin by exploring the asymptotic behavior of the parameter estimator. We then prove a 
Lipschitz stability result applicable to any variational method that is based on a strongly concave 
entropy approximation. This stability result plays a central role in our subsequent development of 
bounds on the performance loss in Sectional 

5.1 Estimator asymptotics 

We begin by considering the asymptotic behavior of the parameter estimator 0" defined by the 
surrogate likelihood p4|) . Since this parameter estimator is a particular type of M-estimator, its 
asymptotic behavior can be investigated using standard methods, as summarized in the following: 

Proposition 2. Let B he a strictly convex surrogate to the cumulant generating function, defined 
via equation (|16() with a strictly concave entropy approximation —B*. Consider the sequence of 
parameter estimates {^"} given by 

r := argmax{(/2", ^)-S(^)- A"i?(^)} (26) 

where R is a non-negative and convex regularizer, and the regularization parameter satisfies A" = 

Then for a general graph with cycles, the following results hold: 
(a) we have 9^ — ^ 6, where 6 is (in general) distinct from the true parameter 9*. 
(h) the estimator is asymptotically normal: 

^ n(^, {V^B{9))-^V'^A{9*){V'^B{9))~^^ 

Proof. By construction, the convex surrogate B and the (negative) entropy approximation B* are 
a Fenchel-Legendre conjugate dual pair. From Proposition the surrogate B is differentiable. 
Moreover, the strict convexity of B and B* ensure that the gradient mapping Vi3 is one-to-one 
and onto the relative interior of the constraint set REL(^(G) (see Section 26 of Rockafellar I19j). 
Moreover, the inverse mapping (Vi?)~^ exists, and is given by the dual gradient VB*. 

Let n* be the moment parameters associated with the true distribution 9* (i.e., fi* = Eg* [(^(X)]). 
In the limit of infinite data, the asymptotic value of the parameter estimate is defined by 

VB(9)=fi*. (27) 
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Note that /i* belongs to the relative interior of MARG0(G), and hence to the relative interior of 
REL<^(G). Therefore, equation (P7|) has a unique solution 6 = V~^i?(/i*). 

By strict convexity, the regularized surrogate likelihood (|26|) has a unique global maximum. 
Let us consider the optimality conditions defining this unique maximum O"^; they are given by 
VB{6"') = /i" — X^dR{6^), where dR{6^) denotes an arbitrary element of the subdifferential of the 
convex function R at the point 9^. We can now write 



Taking inner products with the difference 9"" — 9 yields 



(a) 

< 



B( 



VB{9) 



.*iT 



+ y'dR{9 



n\T 



(28) 



(29) 



where inequality (a) follows from the convexity of B. From the convexity and non- negativity of R, 
we have 



X''dR{9 



n\T 



< y 



R{9)-R{9'^) < X''R{9). 



Applying this inequality and Cauchy-Schwartz to equation ()29|) yields 



< 



5(r) - VB{9) 



< 



(30) 



Since A" 
quantity 



o(l) by assumption and 

T 



Op(l) by the weak law of large numbers, the 



B{9'')-VB{9) 



converges in probability to zero. By the strict convexity of 



B, this fact implies that ^" converges in probability to 9, thereby completing the proof of part (a). 

To establish part (b), we observe that ^[fl"- - fi*] N{0,V^A{9*)) by the central limit 
theorem. Using this fact and applying the delta method to equation H28|) yields that 



^/EV^B{9) 

where we have used the fact that ^/nX"" 
is invertible, so that claim (b) follows. 



N {0,V^A{9*)) , 

o(l). The strict convexity of B guarantees that V'^B{9) 

□ 



A key property of the estimator is its inconsistency — i.e., the estimated model differs from the 
true model 9* even in the limit of large data. Despite this inconsistency, we will see that the 
approximate parameter estimates 0" are nonetheless useful for performing prediction. 



5.2 Algorithmic stability 

A desirable property of any algorithm — particularly one applied to statistical data — is that it 
exhibit an appropriate form of stability with respect to its inputs. Not all message-passing al- 
gorithms have such stability properties. For instance, the standard sum-product message-passing 
algorithm, although stable for relatively weakly coupled MRFs 124j , can be highly unstable in 
other regimes due to the appearance of multiple local optima in the non-convex Bethe problem. 
However, previous experimental work has shown that methods based on convex relaxations, in- 
cluding the reweighted sum-product (or belief propagation) algorithm [2Hj , reweighted generalized 
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BP pi], and log-determinant relaxations appear to be very stable. For instance, Figure [^pro- 
vides a simple illustration of the instability of the ordinary sum-product algorithm, contrasted with 
the stability of the tree-reweighted updates. Wiegerinck j34| provides similar results for reweighted 
forms of the generalized belief propagation. Here we provide theoretical support for these empirical 



Grid with attractive coupiing 




0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 
Coupiing strenglii 



Figure 2. Contrast of the instability of the ordinary sum-product algorithm with the stability of the 
tree-reweighted version Results shown with a grid with N = 100 nodes over a range of attractive 
coupling strengths. The ordinary sum-product undergoes a phase transition, after which the quality 
of marginal approximations degrades substantially. The tree-reweighted algorithm, shown for two 
different settings of the edge weights pst, remains stable over the full range of coupling strengths. 
See Wainwright et al. [^Hl for full details. 

observations: in particular, we prove that, in sharp contrast to non-convex methods, any variational 
method based on a strongly convex entropy approximation is globally stable. This stability prop- 
erty plays a fundamental role in providing a performance guarantee on joint estimation/prediction 
methods. 

We begin by noting that for a multinomial Markov random field the computation of the 
exact marginal probabilities is a globally Lipschitz operation: 

Lemma 2. For any discrete Markov random field there is a constant L < +oo such that 

\\n{e + 6) - fi{9)\\ < L\\6\\ for all 9,6 gR'^. (31) 

This lemma, which is proved in Appendix El guarantees that small changes in the problem 
parameters — that is, "perturbations" 6 — lead to correspondingly small changes in the computed 
marginals. 

Our goal is to establish analogous Lipschitz properties for variational methods. In particular, it 
turns out that any variational method based on a suitably concave entropy approximation satisfies 
such a stability condition. More precisely, a function / : ^ M is strongly convex if there exists 
a constant c > such that f{y) > f{x) + V/(x)^ {y — x) + ^ \\y — x|p for all x, y G M"-. For a twice 
continuously differentiable function, this condition is equivalent to having the eigenspectrum of the 
Hessian V^/(a;) be uniformly bounded below by c. With this definition, we have: 

Proposition 3. Consider any strictly convex surrogate B based on a strongly concave entropy 
approximation —B*. Then there exists a constant R < +oo such that 

\\t{9 + 6) - t{9)\\ < R\\6\\ for all 9,6 eR'^. 



13 



Proof. From Proposition^ we have t{8) = VB{6), so that the statement is equivalent to the 
assertion that the gradient Vi? is a Lipschitz function. Applying the mean value theorem to Vi?, 
we can write VB{6 + 6) — S/B{6) = V^B{9 + t6)S where t G [0,1]. Consequently, in order to 
establish the Lipschitz condition, it suffices to show that the spectral norm of \/'^B{-f) is uniformly 
bounded above over all 7 G M'^. Since B and B* are a strictly convex Legendre pair, we have 
V'^B{9) = [V'^B*{T{e))]~^. By the strong convexity of B* , we are guaranteed that the spectral 
norm of V^i?*(r) is uniformly bounded away from zero, which yields the claim. □ 

Many existing entropy approximations can be shown to be strongly concave. In Appendix lO 
we provide a detailed proof of this fact for the convexified Bethe entropy H19() . 

Lemma 3. For any set {pst} of strictly positive edge appearance probabilities, the convexified Bethe 
entropy (|19|) is strongly concave. 

We note that the same argument can be used to establish strong concavity for the reweighted 
Kikuchi approximations studied by Wiegerinck Moreover, it can be shown that the Gaussian- 
based log-determinant relaxation considered in Wainwright and Jordan PT] is also strongly con- 
cave. For all of these variational methods, then, Proposition |31 guarantees that the pseudomarginal 
computation is globally Lipschitz stable, thereby providing theoretical confirmation of previous 
experimental results [3 H l29 | l3T]. 

6 Performance bounds 

In this section, we develop theoretical bounds on the performance loss of our approximate approach 
to joint estimation and prediction, relative to the unattainable Bayes optimum. So as not to 
unnecessarily complicate the result, we focus on the performance loss in the infinite data limit^ 
(i.e., for which the number of samples n = +00). 

In the infinite data setting, the Bayes optimum is unattainable for two reasons: 

1. it is based on knowledge of the exact parameter 9*, which is not easy to obtain. 

2. it assumes (in the prediction phase) that computing exact marginal probabilities of the 
Markov random field is feasible. 

Of these two difficulties, it is the latter assumption — regarding the computation of marginal probabilities — 
that is the most serious. As discussed earlier, there do exist computationally tractable estimators of 
9* that are consistent though not statistically efficient under appropriate conditions; one example is 
the pseudolikelihood method mentioned previously. On the other hand, MCMC methods may 
be used to generate stochastic approximations to marginal probabilities, but may require greater 
than polynomial complexity. 

Recall from Proposition |21 that the parameter estimator based on the surrogate likelihood Ib 
is inconsistent, in the sense that the parameter vector 9 returned in the limit of infinite data is 
generally not equal to the true parameter 9*. Our analysis in this section will demonstrate that 
this inconsistency is beneficial. 

^Note, however, that modified forms of the resuhs given here, modulo the usual 0{l/n) corrections, hold for the 
finite data setting. 
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6.1 Problem set-up 



Although the ideas and techniques described here are more generally applicable, we focus here on 
a special observation model so as to obtain a concrete result. 



Observation model: In particular, we assume that the multinomial random vector X = {Xg, s G 
V} defined by the Markov random field (jlj is a label vector for the components in a finite mix- 
ture of Gaussians. For each node s S we specify a new random variable Zg by the conditional 
distribution 

p{Zs = Zs\Xs= j) ~ N{uj,a'j) for j £ {0,1, . . . ,m - 1}, 

so that Zg is a mixture of m Gaussians. Such Gaussian mixture models are widely used in spatial 
statistics as well as statistical signal and image processing |S1 El 121] ■ 

Now suppose that we observe a noise-corrupted version of Zg — namely, a vector Y of observa- 
tions with components of the form 

Ys = aZs + ^l-a'^Ws, (32) 

where Wg ~ -^(0, 1) is additive Gaussian noise, and the parameter a G [0, 1] specifies the signal-to- 
noise ratio (SNR) of the observation model. Note that a = corresponds to pure noise, whereas 
a = 1 corresponds to completely uncorrupted observations. 



Optimal prediction: Our goal is to compute an optimal estimate z{y) of z as a function of 
the observation Y = y, using the mean-squared error as the risk function. The essential object 
in this computation is the posterior distribution p{x \ y; 6*) = p{x; 0*)p{y \ x), where the 
conditional distribution p{y \ x) is defined by the observation model H32() . As shown in the sequel, 
the posterior distribution (with y fixed) can be expressed as an exponential family member of the 
form 6* +7(2/) (see equation H39a() ). Disregarding computational cost, it is straightforward to show 
that the optimal Bayes least squares estimator (BLSE) takes the form 



z°gP\Y;e*) 



Y^^,gij;9*+j(Y)) 



j=0 



ujj{a){Ys - Vj) + Vj 



(33) 



where fis{j',d* + j) denotes the marginal probability associated with the posterior distribution 
p{x ; 0* + 7), and 



ujj{a) :-- 



a^a'] + (1 - q2) 



(34) 



is the usual BLSE weighting for a Gaussian with variance cr| 



Approximate prediction: Since the marginal distributions /is(j; Q* +7) are intractable to com- 
pute exactly, it is natural to consider an approximate predictor, based on a set r of pseudomarginals 
computed from a variational relaxation. More explicitly, we run the variational algorithm on the 
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parameter vector + 7 that is obtained by combining the new observation y with the fitted model 
6, and use the outputted pseudomarginals Ts(-; + 7) as weights in the approximate predictor 



m— 1 

^pp(y;^) := Y.rs{r,e + ^{Y)) 

j=0 



ijj 



'j{a){Ys-Uj) +Vj 



(35) 



where the weights oo are defined in equation (|34|) . 

We now turn to a comparison of the Bayes least-squares estimator (BLSE) defined in equa- 
tion (|33|) to the surrogate-based predictor (|35j) . Since (by definition) the BLSE is optimal for 
the mean-squared error (MSE), using the surrogate-based predictor will necessarily lead to a larger 
MSE. Our goal is to prove an upper bound on the maximal possible increase in this MSE, where the 
bound is specified in terms of the underlying model 6* and the SNR parameter a. More specifically, 
for a given problem, we define the mean-squared errors 

R^PP(a, r ) := ^E||2°P'(y; 6*) - Zf, and R°P*(a, 9) := ^E||J^PP(y; 6) - Zf, (36) 

of the Bayes-optimal and surrogate-based predictors, respectively. We seek upper bounds on the 
increase AR(a, 0*,0) := R^PP(a,^) — R°P*(a,0*) of the approximate predictor relative to Bayes 
optimum. 



6.2 Role of stability 

Before providing a technical statement and proof, we begin with some intuition underlying the 
bounds, and the role of Lipschitz stability. First, consider the low SNR regime (a « 0) in which 
the observation Y is heavily corrupted by noise. In the limit a = 0, the new observations are pure 
noise, so that the prediction of Z should be based simply on the estimated model — namely, the 
true model p{- ; 9*) in the Bayes optimal case, and the "incorrect" model p{- ; 9) for the method 
based on surrogate likelihood. The key point here is the following: by properties of the MLE and 
surrogate-based estimator, the following equalities hold: 

VA{9*) ^JL{9*) /X* t(9) VB{9). (37) 

Here equality (a) follows from Lemma ^ whereas equality (b) follows from the moment-matching 
property of the MLE in exponential families. Equalities (c) and (d) hold from the Proposition^ 
and the pseudomoment-matching property of the surrogate-based parameter estimator (see proof 
of Proposition 12) . As a key consequence, it follows that the combination of surrogate-based es- 
timation and prediction is functionally indistinguishable from the Bayes-optimal behavior in the 
limit of a = 0. More specifically, in the limiting case, the errors systematically introduced by the 
inconsistent learning procedure are cancelled out exactly by the approximate variational method for 
computing marginal distributions. Of course, exactness for a = is of limited interest; however, 
when combined with the Lipschitz stability ensured by Proposition |31 it allows us to gain good 
control of the low SNR regime. At the other extreme of high SNR (a ~ 1), the observations are 
nearly perfect, and hence dominate the behavior of the optimal estimator. More precisely, for a 
close to 1, we have ujj{a) 1 for all j = 0, 1, . . . , m - 1, so that ^p\Y;9*) ^Y^ TP'p{Y;9). 
Consequently, in the high SNR regime, accuracy of the marginal computation has little effect on 
the accuracy of the predictor. 
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6.3 Bound on performance loss 

Although bounds of this nature can be developed in more generality, for simplicity in notation we 
focus here on the case of m = 2 mixture components. We begin by introducing the factors that 
play a role in our bound on the performance loss AR(a, 6* ,9). First, the Lipschitz stability enters 
in the form of the quantity: 

L{e*;9) := sup a^^^{V^ A{e* + 6) - B {9 + 5)) , (38) 

where cTmax denotes the maximal singular value. Following the argument in the proof of Proposi- 
tion|31 it can be seen that L{9*;9) is finite. 

Second, in order to apply the Lipschitz stability result, it is convenient to express the effect 
of introducing a new observation vector y, drawn from the additive noise observation model ()32() . 
as a perturbation of the exponential parameterization. In particular, for any parameter 9 E R*^ 
and observation y from the model H32|) . the conditional distribution p{x \ y; 9) can be expressed as 
p{x; 9 + 7(y, a)), where the exponential parameter 7(7/, a) has components 

if aVg + (1 - g^) {ys - aupf {ys - avif \ ^ t/ /on ^ 

= 2 I aVf + (1 - a^) + aVg + (1 - a^) ' a^a\ + (1 - a^) / ^^^^-(39^) 

1st = y{s,t)eE. (39b) 

See AppendixOfor a derivation of these relations. 

Third, it is convenient to have short notation for the Gaussian estimators of each mixture 
component: 

gj{Y,; a) := ujj{a) {Y, - uj) + uj, for j = 0, 1 (40) 
With this notation, we have the following 
Theorem 1. The MSE increase AR(a,0*,6') := R(q,0) - R(a,6'*) is upper hounded by 

ARia,9*,9) < Ejmin^l, L(r ; ^) ^1 - goC^.)^ | _ (4^) 

Before proving the bound (|1T|| . we begin by considering its behavior in a few special cases. 

Effect of SNR: First, consider the low SNR limit in which q — > 0^. In this limit, it can be seen 
that ||7(y; q;)|| — > 0, so that the the overall bound AR(a) tends to zero. Similarly, in the high SNR 
limit as a ^ 1~, we see that ujj{a) — > 1 for j = 0, 1, which drives the differences \gi{Ys) — go{Ys)\, 
and in turn the overall bound A R(a) to zero. Thus, the surrogate-based method is optimal in both 
the low and high SNR regimes; its behavior in the intermediate regime is governed by the balance 
between these two terms. 
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Effect of equal variances: Now consider tlie special case of equal variances a'^ = aQ = af, 
in which case io{a) = uJo{a) = LOi{a). Thus, the difference gi{Ys,a) — go{Ys,a) simplifies to 
(1 — u;{a)) (i^i — uq), so that the bound (|^T|) reduces to 



AR{a,e*,e) < (l-a;(a))2(z.i-z.o)2E<|min(l, !>. (42) 



As shown by the simpler expression H42|). for i^i ^ vq, the MSE increase is very small, since such a 
two-component mixture is close to a pure Gaussian. 

Effect of mean difference: Finally consider the case of equal means v = = ui \n the two 
Gaussian mixture components. In this case, we have giiYs, a) — g^iYs, a) = [a'i(a) —ujQ{a)\ \Ys — v], 
so that the bound (|^T|) reduces to 



Here the MSE increase depends on the SNR a and the difference 

aaf ao-Q _ (1 - a^) {aj - ctq) 



uJi{a) - cJo(a) 



Observe, in particular, that the MSE increases tends to zero as the difference — ctq decreases. 
6.4 Proof of Theorem [T] 

By the Pythagorean relation that characterizes the Bayes least squares estimator z*{Y; /x), we have 
AR{a;e*,e) := ^E||2^PP(y; - - ^E||z°P*(y; r ) - 
= ^E\\T'PP{Y;9)-T^\Y;9*)\\l 
Using the definitions of z^PP(y;0) and z°P*(y;^*), some algebraic manipulation yields 
'z^P^Y;9)-r,P\Y;e*)Y = 7) " + 7)1 ' - 5o(n)]' 



< 



Tsi9 + 7) - f^sie* + 7) bi(n) - goiYsW , 



where the second inequality uses the fact that |ts— < 1 since and fis are marginal probabilities. 
Next we write 



1 1^1 

-\\z-^r>^Y;9)-r^'{Y;e*)\\l < - Y;^\r,(9 + j) - f,{9* + j) [g^{Ys) - go{Ys)] 



s=l 
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where the last hne uses the Cauchy-Schwarz inequahty. 

It remains to bound the 2-norm \\t{0 + 7) — fJi{9* + 7)||2- An initial naive bound follows from 
the fact Ts,/is G [0, 1] implies that \ts — ^ whence 



1 



|t-/x||2< 1. 



(45) 



An alternative bound, which will be better for small perturbations 7, can be obtained as follows. 
Using the relation t{0) = fi{6*) guaranteed by the definition of the ML estimator and surrogate 
estimator, we have 



\Ti9 + ^)-^,{0* + ^)\\, 



r{0 + ^)-T{e)\+[|,{e*)-^l{e*+^)] 

V^B(9 + 57) - V'^A{9* + t-f) 



7 



for some s,t £ [0, 1], where we have used the mean value theorem. Thus, using the definition (jSHJ 
of L, we have 



Combining the bounds ()45() and H46|) and applying them to equation 1)44(1 . we obtain 



(46) 



l\\r'^P{Y;9)-^P\Y;e*)\\l < min\l, L{e*;i 



hiY;a) 



N 



Es=i\9i{Ys) - gojYs. 
N 



Taking expectations of both sides yields the result. 



□ 



7 Experimental results 

In order to test our joint estimation/prediction procedure, we have applied it to coupled Gaussian 
mixture models on different graphs, coupling strengths, observation SNRs, and mixture distribu- 
tions. Here we describe both experimental results to quantify the performance loss of the tree- 
reweighted sum-product algorithm j^H], and compare it to both a baseline independence model, as 
well as a closely related heuristic method that uses the ordinary sum-product (or belief propagation) 
algorithm. 

7.1 Methods 

In Section we described a generic procedure for joint estimation and prediction. Here we begin 
by describing the special case of this procedure when the underlying variational method is the 
tree-reweighted sum-product algorithm . Any instantiation of the tree-reweighted sum-product 
algorithm is specified by a collection of edge weights pst, one for each edge (s,t) of the graph. 
The vector of edge weights must belong to the spanning tree polytope; see Wainwright et al. j^H] 
for further background on these weights and the reweighted algorithm. Given a fixed set of edge 
weights p, the joint procedure based on the tree-reweighted sum-product algorithm consists of the 
following steps: 
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1. Given an initial set of i.i.d. data we first compute the empirical marginal 
distributions 

^^sU) ■•=-^i[xi= j], jistU, k) ■.= -Y^i[xi= j] I [XI = k], 

i=\ 1=1 
and use them to compute the approximate parameter estimate 

:= log%{j), e^U) := pst log J'f^^-l^y (47) 

As shown in our previous work |2H] , the estimates ()47() are the global maxima of the surrogate 
likelihood ()24() based on the convexified Bethe approximation H19|) without any regularization 
term (i.e., R = 0). 

2. Given the new noisy observation Y of the form H32|l. we incorporate it by by forming the new 
exponential parameter 



where equation (|39a|) defines 7^ for the Gaussian mixture model under consideration. 

3. We then compute approximate marginals t{6 + 7) by running the TRW sum-product algo- 
rithm with edge appearance weights pst, using the message updates H21() . on the graphical 
model distribution with exponential parameter 9 + j. We use the approximate marginals (see 
equation (|22|) 1 to construct the prediction S^pp in equation (|35|) . 

We evaluated the tree-reweighted sum-product based on its increase in mean-squared error 
(MSE) over the Bayes optimal predictor (|33|) . Moreover, we compared the performance of the 
tree-reweighted approach to the following alternatives: 

(a) As a baseline, we used the independence model in which the mixture distributions at each node 
are all assumed to be independent. In this case, ML estimates of the parameters are given by 
Osixs) = log/Us(xs), with all of the coupling terms 9st{xs-,xt) equal to zero. The prediction 
step reduces to computing the Bayes least squares estimate at each node independently, based 
only on the local data Hs- 

(b) The standard sum-product or belief propagation (BP) approach is closely related to the tree- 
reweighted sum-product method, but based on the edge weights pst = 1 for all edges. In 
particular, we first form the approximate parameter estimate 6 using equation H47() with 
Pst = 1. As shown in our previous work |2HI, this approximate parameter estimate uniquely 
defines the Markov random field for which the empirical marginals fig and p^st are fixed points 
of the ordinary belief propagation algorithm. We note that a parameter estimator of this 
type has been used previously by other researchers ;8;, 2tS|. In the prediction step, we then use 
the ordinary belief propagation algorithm (i.e., again with pat = 1) to compute approximate 
marginals of the graphical model with parameter 6 + "f. Finally, based on these approximate 
BP marginals, we compute the approximate predictor using equation (|35() . 
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7.2 Comparisons 

Although our methods are more generally applicable, here we show representative results for m = 2 
components, and two different types of Gaussian mixtures. 

(a) Mixture ensemble A is bimodal, with components (z^QiCTo) ~ (~li0-5) and (z^i,cr^) = (1,0.5). 

(b) Mixture ensemble B was constructed with mean and variance components (z/o,crg) = (0,1) 
and (z^i,cr^) = (0,9); these choices serve to mimic heavy-tailed behavior. 

In both cases, each mixture component is equally weighted; see Figure IHl for histograms of the 
resulting mixture ensembles. 





(a) 



(b) 



Figure 3. Histograms of different Gaussian mixture ensembles, (a) Ensemble A: a bimodal ensemble 
with {vojCTq) — (—1,0.5) and (i^i,cr^) = (1,0.5). (b) Ensemble B: mimics a heavy-tailed distribution, 
with (z/o,a2) = (0,1) and {i^i,af) - (0,9). 

Here we show results for a 2-D grid with = 64 nodes. Since the mixture variables have m = 2 
states, the coupling distribution can be written as 

p{x; 9*) (xe^p{J2^*s^s+ ^ 9liXsXt}, 

seV {s,t)£E 

where x G {—1, +1}^ are "spin" variables indexing the mixture components. In all trials, we chose 
0* = for all nodes s V, which ensures uniform marginal distributions p{xs ; 9*) = [0.5 0.5]"^ at 
each node. We tested two types of coupling in the underlying Markov random field: 

(a) In the case of attractive coupling, for each coupling strength 7 G [0, 1], we chose edge param- 
eters as 9*f ~ h([0, 7]. 

(b) In the case of mixed coupling, for each coupling strength 7 £ [0, 1], we chose edge parameters 
as 6**4 [-7, 7]. 

Here U[a,b] denotes a uniform distribution on the interval [a,b]. In all cases, we varied the SNR 
parameter a, as specified in the observation model (|32|) . in the interval [0, 1]. 

Shown in Figure 0] are 2-D surface plots of the average percentage increase in MSE, taken over 
100 trials, as a function of the coupling strength 7 G [0, 1] and the observation SNR parameter 
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a £ [0, 1] for the independence model (left column), BP approach (middle column) and TRW 
method (right column). The top two rows show performance for attractive coupling, for mixture 
ensemble A ((a) through (c)) and ensemble B ((d) through (f)), whereas the bottom two row show 
performance for mixed coupling, for mixture ensemble A ((g) through (i)) and ensemble B ((j) 
through (1)). 

First, observe that for weakly coupled problems (7 ~ 0), whether attractive or mixed cou- 
pling, all three methods — including the independence model — perform quite well, as should be 
expected given the weak dependency between different nodes in the Markov random field. Al- 
though not clear in these plots, the standard BP method outperforms the TRW-based method 
for weak coupling; however, both methods lose than than 1% in this regime. As the coupling is 
increased, the BP method eventually deteriorates quite seriously; indeed, for large enough coupling 
and low/intermediate SNR, its performance can be worse than the independence (IND) model. 
This deterioration is particularly severe for the case of mixture ensemble A with attractive cou- 
pling, where the percentage loss in BP can be as high as 50%. Looking at alternative models (in 
which phase transitions are known), we have found that this type of rapid degradation coincides 
with the appearance of multiple fixed points. In contrast, the behavior of the TRW method is 
extremely stable, which is consistent with our theoretical results. 

7.3 Comparison between theory and practice 

We now compare the practical behavior of the tree-reweighted sum-product algorithm to the the- 
oretical predictions from Theorem ^ In general, we have found that in quantitative terms, the 
bounds ()41() are rather conservative — in particular, the TRW sum-product method performs much 
better than the bounds would predict. However, here we show how the bounds can capture quali- 
tative aspects of the MSE increase in different regimes. 

Figure |S1 provides plots of the actual MSE increase for the TRW algorithm (dotted blue lines) , 
compared to the theoretical bound (|4H) (solid red lines), for the grid with = 64 nodes, and 
attractive coupling of strength 7 = 0.70. For all comparisons in both panels, we used L = 0.10, 
which numerical calculations showed to be a reasonable choice for this coupling strength. (Overall, 
changes in the constant L primarily cause the bounds to shift up and down on the log scale, and 
so do not overly affect the qualitative comparisons given here.) Panel (a) provides the comparison 
ensembles of type A, with fixed variances cjg = cr^ = 0.5 and mean vectors (1/0,^^1) ranging from 
(—0.5,0.5) to (—2.5,2.5). Note how the bounds capture the qualitative behavior for low SNR, for 
which the difficulty of the problem increases as the mean separation is increased. In contrast, in the 
high SNR regime, the bounds are extremely conservative, and fail to predict that the sharp drop-off 
in error as the SNR parameter a approaches one. This drop-off is particularly pronounced for the 
ensemble with largest mean separation (marked with +). Panel (b) provides a similar comparison 
for ensembles of type B, with fixed mean vectors uq = vi = 0, and variances (crg,0"f) ranging from 
(1,1.25) to (1,25). In this case, although the bounds are still very conservative in quantitative 
terms, they reasonably capture the qualitative behavior of the error over the full range of SNR. 

8 Discussion 

Key challenges in the application of Markov random fields include the estimation (learning) of 
model parameters, and performing prediction using noisy samples (e.g., smoothing, interpolation. 
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Figure 4. Surface plots of the percentage increase in MSE relative to Bayes optimum for different 
methods as a function of observation SNR for grids with = 64 nodes. Left column: independence 
model (IND). Center column: ordinary belief propagation (BP). Right column: tree-reweighted 
algorithm (TRW). First row: Attractive coupling and a Gaussian mixture with components (fo, ctq) ~ 
(—1,0.5) and {vi,af) = (1,0.5). Second row: Attractive coupling and a Gaussian mixture with 
components {i^o,aQ) — (0,1) and {i^o,af) — (0,9). Third row: Mixed couphng and a Gaussian 
mixture with components (i^oj^o) = (—1,0.5) and {i/ijaf) = (1,0.5). Fourth row: Mixed coupling 
and a Gaussian mixture with components {i^o, CTq) — (0, 1) and (I'o, ctj) = (0, 9). 
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Figure 5. Comparison of actual MSE increase and upper bounds for grid with = 64 nodes with 
attractive couphng. (a) Equal variances = — 0.5, and mean vectors {vo,vi) ranging from 
(—0.5,0.5) to (—2.5,2.5). (b) Equal mean vectors i^o = i^i = 0, and variances {(jQ,(yf) ranging from 
(1,1.25) to (1,25). 



denoising) . Both of these problems present substantial computational challenges for general Markov 
random fields. In this paper, we have described and analyzed methods for joint estimation and 
prediction that are based on convex variational methods. Our central result is that using inconsis- 
tent parameter estimators can be beneficial in the computation-limited setting. Indeed, our results 
provide rigorous confirmation of the fact that using parameter estimates that are "systematically 
incorrect" is helpful in offsetting the error introduced by using an approximate method during the 
prediction step. In concrete terms, we demonstrated that a joint prediction/estimation method 
using the tree-reweighted sum-product algorithm yields good performance across a wide range 
of experimental conditions. Although our work has focused on a particular scenario, we suspect 
that similar ideas and techniques will be useful in related applications of approximate methods for 
learning and prediction. 
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A Tree-based relaxation 

As an illustration on the single cycle on 3 vertices, the pseudomarginal vector with elements 

for s = 1,2,3 and Tst{xs,xt) 



T~<i \ X a 



0.5 
0.5 



Ust 0.5 - Ost 
0.5 - ast OLst 



belongs to LOCAL(^(G) for all choices Ugt £ [0, 0.5], but fails to belong to MARG0(G), for instance, 
when ai2 = 023 = ai3 = 0. 
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B Proof of Lemma [2] 



Using Lemma Hand the mean value theorem, we write 

n{9 + 6)-n{e) = vA{e + 6)-vA{e) 

= V^A{9 + t6)6 

for some t G (0, 1). Hence, it suffices to show that the eigenspectrum of the Hessian \7'^A{9) = cove{<^(X)} 
is uniformly bounded above by L < +00. The functions (p are all 0-1 valued indicator functions, so 
that the diagonal elements of covg are bounded above — in particular, vaT{(pa{X)) < ^ for 

any index a £ {1, ... ,d}. Consequently, we have 

d ^ 

Amax(cove{0(X)}) < ^ Aa(cov6){(/)(X)} = trace(cove{0(X)} = - 

a=l 

as required. 

C Proof of Lemma [HI 

Consider a spanning tree T of G with edge set E{T). Given a vector r S LOCAL(^(G), we associate 
with T a subvector t(T) formed by those components of r associated with vertices V and edges 
E{T). Note that by construction t(T) G LOCAL^(r) = MARG<^(T). The mapping r ^ t{T) can 
be represented by a projection matrix H^ G M'^(^)^'^ with the block structure 

:= [ld(T)xd(T) Orf(T)x(d-d(T))] • 

In this definition, we are assuming for convenience that r is ordered such that the d{T) components 
corresponding to the tree T are placed first. With this notation, we have n^r = [t{T) O] . 

By our construction of the function Bp, there exists a probability distribution p := {p(T) | T G T} 
such that Bp{T) = ^'r(z<xP{T)A*{T{T)), where A*{t{T)) denotes the negative entropy of the tree- 
structured distribution defined by the vector of marginals t(T). Hence, the Hessian of Bp has the 
decomposition 

v'Bpir) = J]p(r)(n^)'v2^*(r(r))(n^) (48) 

Tel 

To check dimensions of the various quantities, note that V^A*(r(T)) is a d{T) x d{T) matrix, and 
recah that each matrix H^ G Mfi(^)x'^. 

Now by Lemma El the eigenvalues of the V'^A are uniformly bounded above; hence, the eigen- 
values of V'^A* are uniformly bounded away from zero. Hence, for each tree T, there exists a 
constant Ct such that for all z G M*^ 

z'{U^yV^A*{T{T)){U^)z > CrWU^zf = Cr||z{T}f. 

Substituting this relation into our decomposition (|48() and expanding the sum over T yields 

z'V'Bp{T)z > Y.p{T)CT\\z{T}f 
Tel 

= [Y^p{T)CT]Y^\\z{s}f+ [Y.P(T)CTms,t)eE{T)]]\\z{{s,t)}f. (49) 

Tel sev {s,t)eE Tel 
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Defining C* := minj-gx C't; we have the lower bounds 

[^p{T)Ct] > C*Y,p{T) = C* > 

p{T)CtI [{s, t) G E{T)] > C*T e 1p{T)I [{s, t) G E{T)] = C*pst > C*p* > 0, 

Tel 

where p* := min pst > 0. Applying these bounds to equation (|49() yields the final inequality 

{s,t)eE 

z'V'^Bp{T)z > C*p*\\zf VzeM'^ (50) 
with C*p* > 0, which establishes that the eigenvalues of V'^Bp{T) are bounded away from zero. 

D Form of exponential parameter 



Consider the observation model ys = azs + Vl — Oi'^Vg, where Vg ~ -^(0, 1) and Zs is a mixture of 
two Gaussians (i^OjO'o) (i^i,c7f). Conditioned on the value of the mixing indicator Xs = j, the 
distribution of Us is Gaussian with mean auj and variance a^cr| + (1 — a^). 

Let us focus on one component p{ys \ Xs) in the factorized conditional distribution p(y \ x) = 
nr=iP(ys I Xs)- For j = 0, 1, it has the form 

piys\Xs=j) = ^ . ^^P| ~ of 2 2 //I ^iVs -aiyj?}. (51) 

/27r[aV2 + (l-a2)] 2 [aV| + (1 - a^)] J 



We wish to represent the influence of this term on Xg in the form exp(7sXs) for some exponential 
parameter 7s. We see that 7^ should have the form 

Is = ^ogp{ys\Xs = 1) -logp{ys\Xs = 0) 

1 [a^f^o + (1 - "^)] , {ys-auof {y^ - au^)^ 



2 ^ [aVf + (1 - q2)] 2 [a^a^ + (1 - a^)] 2 [aVf + (1 - q2)] 
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